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Abstract 

The computational complexity of different steps of the basic SSA is 
discussed. It is shown that the use of the general-purpose "blackbox" 
routines (e.g. found in packages like LAPACK) leads to huge waste of time 
resources since the special Hankel structure of the trajectory matrix is not 
taken into account. 

We outline several state-of-the-art algorithms (for example, Lanczos- 
based truncated SVD) which can be modified to exploit the structure of 
the trajectory matrix. The key components here are Hankel matrix- vector 
multiplication and the Hankelization operator. We show that both can 
be computed efficiently by the means of Fast Fourier Transform. 

The use of these methods yields the reduction of the worst-case compu- 
tational complexity from 0(N 3 ) to 0(kN log N), where N is series length 
and k is the number of eigentriples desired. 

1 Introduction 

Despite the recent growth of SSA-related techniques it seems that little attention 
was paid to their efficient implementations, e.g. the singular value decomposi- 
tion of Hankel trajectory matrix, which is a dominating time-consuming step of 
the basic SSA. 

In almost all SSA-related applications only a few leading eigentriples are 
desired, thus the use of general-purpose "blackbox" SVD implementations causes 
a huge waste of time and memory resources. This almost prevents the use of the 
optimum window sizes even on moderate length (say, few thousand) time series. 
The problem is much more severe for 2D-SSA [TS] or subspace-based methods 
|5J |SJ [T5] , where large window sizes are typical. 

We show that proper use of state-of-art singular value decomposition al- 
gorithms can significantly reduce the amount of operations needed to compute 
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truncated SVD suitable for SSA purposes. This is possible because the informa- 
tion about the leading singular triples becomes available long before the costly 
full decomposition. 

These algorithms are usually based on Lanczos iterations or related methods. 
This algorithm can be modified to exploit the Hankel structure of the data 
matrix effectively. Thus the computational burden of SVD can be considerably 
reduced. 

In section [2] we will outline the computational and storage complexity of 
the steps of the basic SSA algorithm. Section [3] contains a review of well- 
known results connected with the Lanczos singular value decompostion algo- 
rithm. Generic implementation issues in finite-precision arithmetics will be 
discussed as well. The derivation of the efficient Hankel SSA trajectory matrix- 
vector multiplication method will be presented in section|4] This method is a key 
component of the fast truncated SVD for the Hankel SSA trajectory matrices. 
In the section [5] we will exploit the special structure of the rank 1 Hankelization 
operator which allows its efficient implementation. Finally, all the implementa- 
tions of the mentioned algorithms are compared in sections [6] and [7] on artificial 
and real data. 



2 Complexity of the Basic SSA Algorithm 

Let N > 2. Denote by F = (/i,...,/jv) a real- valued time series of length 
N. Fix the window length L, 1 < L < N. The Basic SSA algorithm for the 
decomposition of a time series consists of four steps [14]. We will consider each 
step separately and discuss its computational and storage complexity. 



2.1 Embedding 



The embedding step maps original time series to a sequence of lagged vectors. 
The embedding procedure forms K — N — L + 1 lagged vectors Xi € M. L : 

Xi = CA,---,/*+l-i) T , Ki^K. 
These vectors form the L-trajectory (or trajectory) matrix X of the series F: 



X = 



(h 

h 



h 
h 



fn-i 
Sk 



L+l 



N-l 



Ik \ 

Jk+i 



In J 



Note that this matrix has equal values along the antidiagonals, thus it is a Hankel 
matrix. The computational complexity of this step is negligible: it consists of 
simple data movement which is usually considered a cheap procedure. However, 
if the matrix is stored as-is, then this step obviously has storage complexity of 
O(LK) with the worst case being 0(N 2 ) when K ~ L ~ N/2. 
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2.2 Singular Value Decomposition 



This step performs the singular value decomposition of the L x K trajectory 
matrix X. Without loss of generality we will assume that L ^ K. Then the 
SVD can be written as 



where U = (ui, . . . , Ul) is a L x L orthogonal matrix, V — (vi, . . . , Vk) is a 
KxK orthogonal matrix, and E is a Lx K diagonal matrix with nonnegative real 
diagonal entries E^ = c^, for i = 1, . . . , L. The vectors Uj are called Ze/t singular 
vectors, the «j are the right singular vectors, and the Uj are the singular values. 
We will label the singular values in descending order: o~\ ^ o~% ^ • ■ • ^ erj,. Note 
that the singular value decomposition of X can be represented in the form 



where Xj = a^vf . The matrices Xj have rank 1, therefore we will call them 
elementary matrices. The triple (o~i,Ui,Vi) will be called a singular triple. 

Usually the SVD is computed by means of Golub and Reinsh algorithm [12 
which requires 0(L 2 K + LK 2 + K 3 ) multiplications having the worst compu- 
tational complexity of 0(N 3 ) when K ~ L ~ AT/2. Note that all the data must 
be processed at once and SVDs even of moderate length time series (say, when 
N > 5000) are essentially unfeasible. 

2.3 Grouping 

The grouping procedure partitions the set of indices {1, . . . , L} into m disjoint 
sets 1%, . . . , I m . For the index set / = . . . , i p } one can define the resultant 
matrix Xj = + • • • + Xi p . Such matrices are computed for I = I\, . . . ,I m 
and we obtain the decomposition 



For the sake of simplicity we will assume that m — L and each Ij = {j} later 
on. The computation of each elementary matrix Xj costs O(LK) multiplications 
and 0(LK) storage yielding overall computation complexity of 0(L 2 K) and 
0(L 2 K) for storage. Again the worst case is achieved at L ~ K ~ N/2 with 
the value 0(N 3 ) for both computation and storage. 

2.4 Diagonal Averaging (Hankelization) 

This is the last step of the Basic SSA algorithm transforming each grouped 
matrix Xi j into a new time series of length N. This is performed by means 
of diagonal averaging procedure (or Hankelization). It transforms an arbitrary 
L x K matrix Y to the series g±, . . . , by the formula: 



X = U T YV, 



X = X l + ---+X L , 



X = X h +--- + Xi, 




L < k < K, 
K ^k^N. 
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Note that these two steps are usually combined. One just forms the elemen- 
tary matrices one by one and immediately applies the Hankelization operator. 
This trick reduces the storage requirements considerably. 

O(N) multiplications and O(LK) additions are required for performing the 
diagonal averaging of one elementary matrix. Summing these values for L el- 
ementary matrices we obtain the overall computation complexity for this step 
being 0(L 2 K) additions and O(NL) multiplications. The worst case complexity 
will be then 0(N 3 ) additions and 0(N 2 ) multiplications 

Summing the complexity values alltogether for all four steps we obtain the 
worst case computational complexity to be 0(N 3 ). The worst case occurs ex- 
actly when window length L equals N/2 (the optimal window length for asymp- 
totic separability [H]). 

3 Truncated Singular Value Decomposition 

The typical approach to the problem of computating the singular triples 
(ai,Ui,Vi) of A is to use the Schur decomposition of some matrices related to 
A, namely 

1. the cross product matrices AA T , A T A, or 

2. the cyclic matrix C = ( ?r ^). 




One can prove the following theorem (see, e.g. [Ill fT5] ): 

Theorem 1 Let A be an L x K matrix and assume without loss of generality 
that K ^ L. Let the singular value decomposition of A be 




U T AV = &mg (a 1 ,...,a L ). 



Then 



V T (A T A) V 



diag(er^ . . . 



,*l,Q,...,0) 



K-L 



U T (AA T ) U 



diag (a 2 ,... 




Moreover, if V is partitioned as 



V = (Vi,V 2 ), Vi eR KxL 1 V 2 eR Kx{K - L) , 
then the orthonormal columns of the (L + K) x (L + K) matrix 




form an eigenvector basis for the cyclic matrix C and 

Y T CY = diag(tJi, . . . ,a L , -a 1: . . . , -o L ,Q, . . . ,0). 



K-L 
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The theorem tells us that we can obtain the singular values and vectors of 
A by computing the eigenvalues and corresponding eigenvectors of one of the 
symmetric matrices. This fact forms the basis of any SVD algorithm. 

In order to find all eigenvalues and eigenvectors of the real symmetric matrix 
one usually turns it into a similar tridiagonal matrix by the means of orthogo- 
nal transformations (Householder rotations or alternatively via Lanczos recur- 
renses). Then, for example, another series of orthogonal transformations can be 
applied to the latter tridiagonal matrix which converges to a diagonal matrix 

uu. 

Such approach, which in its original form is due to Golub and Reinsh |12| 
and is used in e.g. LAPACK [1 , computes the SVD by implicitly applying the 
QR algorithm to the symmetric eigenvalue problem for A 1 A. 

For large and sparse matrices the Golub-Reinsh algorithm is impractical. 
The algorithm itself starts out by applying a series of transformations directly 
to matrix A to reduce it to the special bidiagonal form. Therefore it requires the 
matrix to be stored explicitlly, which may be impossible simply due to its size. 
Moreover, it may be difficult to take advantage of any structure (e.g. Hankel in 
SSA case) or sparsity in A since it is quickly destroyed by the transformations 
applied to the matrix. The same argument applies to SVD computation via the 
Jacobi algorithm. 

Bidiagonalization was proposed by Golub and Kahan |10| as a way of tridi- 
agonalizing the cross product matrix without forming it explicitly. The method 
yields the decomposition 

A = P T BQ, 

where P and Q are orthogonal matrices and B is an L x K lower bidiagonal 
matrix. Here BB T is similar to AA T . Additionally, there are well-known SVD 
algorithms for real bidiagonal matrices, for example, the QR method [11 , divide- 
and-conquer \T6\, and twisted [7] methods. 
The structure of our exposition follows [H5] . 

3.1 Lanczos Bidiagonalization 

For a rectangular L x K matrix A the Lanczos bidiagonalization computes a 
sequence of left Lanczos vectors Uj £ M. L and right Lanczos vectors Vj G M. K 
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and scalars o>j and fa, j = 1, . . . , k as follows: 



Algorithm 1: Golub-Kahan-Lanczos Bidiagonalization 



1 Choose a starting vector po G 1 

a ft < — lbo|| 2 . "i < — Po/^i, «o 
3 for j <— 1 to k do 



4 fj < j4 T 1 

5 ay < — \\rj 

6 < — Av 

7 fa+l 

8 end 



CXiUj 



f>j\\ 2 , u j + 1 i — Pj/fa+1 



Later on we will refer to one iteration of the for loop at line [3] as a Lanczos step 
(or iteration). After k steps we will have the lower bidiagonal matrix 



fa a 2 



B, = 



fa 



V 



a k 
fa+ij 



If all the computations performed are exact, then the Lanczos vectors will be 
orthornomal: 



U k+l = . . . ,u k+1 ) G R Lx ^ k+1 \ ULxUk+x = 



and 



Vfe +1 = . . .,v k+1 ) G M A ' x ( fc+1 ), ^Vfc+i = / fe+ i, 

where Ik is k x k identity matrix. By construction, the columns of Uk+i and 
Vfe+i satisfy the recurrences 



*3 u 3 



fay 



fa + lU-j 



+1 



= Aw,- 



and can be written in the compact form as follows: 

AV k = U k+1 B k , 
A T U k+ i = V k Bl + a k+ iv k+1 el +1 . 

Also, the first equiality can be rewritten as 

Ul +l AV k = B k , 



(1) 
(2) 



explaining the notion of left and right Lanczos vectors. Moreover, u k G 
K k (AA T , Ul ), v k G JC k (A T A,vi), where 



IC k (AA T , Ul ) = {u 1 ,AA T u 1 ,...,(AA T ) k Ul } , 
K k (A T A, Vl ) = A T Avi, . . . , (A T A) fc Wl } , 



(3) 
(4) 



G 



and therefore u%, 112, . . . , Uk and v\,V2, ■ ■ ■ ,Vk form an orthogonal basis for these 
two Krylov subspaces. 

3.2 Lanczos Bidiagonalization in Inexact Arithmetics 

When the Lanczos bidiagonalization is carried out in finite precision arithmetics 
the Lanczos recurrence becomes 

UjVj = A T u j -P j v j - 1 +f j , 
Pj+iUj+i = Avj - ctju-j + gj, 

where fj £ K. , gj £ K L are error vectors accounting for the rounding errors 
at the j-th step. Usually, the rounding terms are small, thus after k steps 
the equations (JT|), Q still hold to almost machine accuracy. In contrast, the 
orthogonality among left and right Lanczos vectors is lost. 

However, the Lanzos algorithm possesses a remarkable feature that the ac- 
curacy of the converged Ritz values is not affected by the loss of orthogonality, 
but while the largest singular values of are still accurate approximations to 
the largest singular values of A, the spectrum of will in addition contain false 
multiple copies of converged Ritz values (this happens even if the correspond- 
ing true singular values of A are isolated). Moreover, spurious singular values 
("ghosts") periodically appear between the already converged Ritz values (see, 
e.g. ^19] for illustration). 

There are two different approaches with respect to what should be done to 
obtain a robust method in finite arithmetics. 

3.2.1 Lanczos Bidiagonalization with no orthogonalization 

One approach is to apply the simple Lanczos process "as-is" and subsequently 
use some criterion to detect and remove spurious singular values. 

The advantage of this method is that it completely avoids any extra work 
connected with the reorthogonalization, and the storage requirements are very 
low since only few of the latest Lanczos vectors have to be remembered. The 
disadvantage is that many iterations are wasted on simply generating multiple 
copies of the large values [23 . The number of extra iterations required compared 
to that when executing the Lanczos process in exact arithmetics can be very 
large: up to six times the original number as reported in |24) . Another disdvan- 
tage is that the criterion mentioned above can be rather difficult to implement, 
since it depends on the correct choice of different thresholds and knobs. 

3.2.2 Lanczos Bidiagonalization using reorthogonalization 

A different way to deal with loss of orthogonalizy among Lanczos vectors is 
to apply some reorthogonalization scheme. The simplest one is so-callled full 
reorthogonalization (FRO) where each new Lanczos vector (or Uj+i) is 

orthogonalized against all the Lanczos vectors generated so far using, e.g., the 
modified Gram-Schmidt algorithm. 
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This approach is usually too expensive, since for the large problem the run- 
ning time needed for reorthogonalization quickly dominate the execution time 
unless the necessary number of iterations k is very small compared to the di- 



mensions of the problem (see section 3.3 for full analysis of the complexity). 



The storage requirements of FRO might also be a limiting factor since all the 
generated Lanczos vectors need to be saved. 

A number of different schemes for reducing the work associated with keeping 
the Lanczos vectors orthogonal were developed (see [3] for extensive review). 
The main idea is to estimate the level of orthogonality among the Lanczos 
vectors and perform the orthogonalization only when necessary. 

For example, the so-called partial reorthogonalization scheme (PRO) de- 
scribed in |271 1251 [Hi] uses the fact that the level of orthogonality among the 
Lanczos vectors satisfies some recurrence relation which can be derived from the 
recurrence used to generate the vectors themselves. 



3.3 The Complexity of the Lanczos Bidiagonalization 

The computational complexity of the Lanczos bidiagonalization is determined by 
the two matrix-vector multiplications, thus the overall complexity of k Lanczos 
iterations in exact arithmetics is 0(kLK) multiplications. 

The computational costs increase when one deals with the lost of orthog- 
onality due to the inexact computations. For FRO scheme the additional 
0{k 2 (L + K)) operations required for the orthogonalization quickly dominate 
the execution time, unless the necessary number of iterations k is very small 
compared to the dimension of the problem. The storage requirements of FRO 
may also be a limiting factor since all the generated Lanczos vectors need to be 
saved. 

In general, there is no way to determine in advance how many steps will 
be needed to provide the singular values of interest within a specified accuracy. 
Usually the number of steps is determined by the distribution of the singular 
numbers and the choice of the starting vector po. In the case when the singular 
values form a cluster the convergence will not occur until k, the number of 
iterations, gets very large. In such situation the method will also suffer from 
increased storage requirements: it is easy to see that they are 0(k(L + K)). 

These problems can be usually solved via limiting the dimension of the 
Krylov subspaces |3]), Q and using restarting schemes: restarting the iterations 
after a number of steps by replacing the starting vector with some "improved" 
starting vector (ideally, we would like po to be a linear combination of the right 
singular vectors of A associated with the desired singular values). Thus, if we 
limit the dimension of the Krylov subspaces to d the storage complexity drops 
to 0(d(L + K)) (surely the number of desired singular triples should be greater 
than d). See [5] for the review of the restarting schemes proposed so far and 
introduction to the thick-restarted Lanczos bidiagonalization algorithm. In the 
papers |30[ 131] different strategies of choosing the best dimension d of Krylov 
subspaces are discussed. 
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If the matrix being decomposed is sparse or structured (for example, Hankel 
in Basic SSA case or Hankel with Hankel blocks for 2D-SSA) then the compu- 
tational complexity of the bidiagonalization can be significantly reduced given 
that the efficient matrix- vector multiplication routine is available. 

4 Efficient Matrix- Vector Multiplication for 
Hankel Matrices 

In this section we will present efficient matrix-vector multiplication algorithm 
for Hankel SSA trajectory matrix. By means of the fast Fourier transform 
the cost of the multiplication drops from ordinary O(LK) multiplications to 
O {{L + K) log(£ + K)). The idea of using the FFT for computing a Han- 
kel matrix-vector product is nothing new: it was probably first described by 
Bluestein in 1968 (see [25J ). A later references are [2T1 15], 

This approach reduces the computational complexity of the singu- 
lar value decomposition of Hankel matrices from O (kLK + k 2 (L + K)j to 
O (k(L + K) log(L + K) + k 2 (L + K)) . The complexity of the worst case when 
K ~ L - N/2 drops significantly from O (kN 2 + k 2 N) to O (kN log N + k 2 N^ 

Also the discribed algorithms provide an efficient way to store the Hankel 
SSA trajectory matrix reducing the storage requirements from O(LK) to 0(L + 
K). 



4.1 Matrix Representation of Discrete Fourier Transform 
and Circulant Matrices 

The 1-d discrete Fourier transform (DFT) of a (complex) vector (/fe) fc J 1 is 
defined by: 

f l= J2e- 2 * ikl / N f k , k = 0,...,N-l. 

fc=0 

Denote by u the primitive ^V-root of unity u) — e ~ 2 ™ l / N . Introduce the DFT 
matrix Fjv: 





A 


l 


1 


\ 




i 


LU 1 ■ 








l 


UJ 2 ■ 


uj 2 ^- 1 ) 






V 


c^- 1 • 


. w (iY-i)(Ar- 


1] ) 



Then the 1-d DFT can be written in matrix form: / = ¥n f. The inverse of the 
DFT matrix is given by 

Fiv 1 - in, 



AT 



Compare with O (TV 3 ) for full SVD via Golub-Reinsh method. 
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where F^ is the adjoint (conjugate transpose) of DFT matrix Fn- Thus, the 
inverse 1-d discrete Fourier transform (IDFT) is given by 



1 N - 1 

1=0 



kl/N 



fl, k = 0, 



,7V- 1. 



The fast Fourier transform is an efficient algorithm to compute the DFT 
(and IDFT) in 0(N log N) complex multiplications instead of N 2 complex mul- 
tiplications in direct implementation of the DFT |20l [9] . 



Definition 1 An N x N matrix C of the form 



C2 



c = 



cn cn-i 
Cl c N 



C3 
C4 



C 2 \ 

C3 



CjV-1 CJV-2 Cn-3 ' ' ■ 
\ Cn Cjv-l Cn-2 ■ ■ ■ 



ci c N 

C2 Cl j 



illed a circulant matrix. 



A circulant matrix is fully specified by its first column c = (ci, C2, . . . , Cn) 
The remaining columns of C are each cyclic permutations of the vector c with 
offset equal to the column index. The last row of C is the vector c in reverse 
order, and the remaining rows are each cyclic permutations of the last row. 

The eigenvectors of a circulant matrix of a given size are the columns of the 
DFT matrix of the same size. Thus a circulant matrix is diagonalized by the 
DFT matrix: 

C = ¥ N 1 diag (Fjvc)Fjv, 

so the eigenvalues of C are given by the product F^rc. 

This factorization can be used to perform efficient matrix-vector multiplica- 
tion. Let v G and C be an N x N circulant matrix. Then 



Cv = F^ 1 diag (F Jv c)F A rt; = F^ 1 (Fjyc ¥ N v) , 

where denotes the element-wise vector multiplication. This can be computed 
efficiently using the FFT by first computing the two DFTs, F^rc and F^u, and 
then computing the IDFT F^ 1 (Fjyc ¥nv). If the matrix- vector multiplication 
is performed repeatedly with the same circulant matrix and different vectors, 
then surely the DFT F^c needs to be computed only once. 

In this way the overall computational complexity of matrix-vector product 
for the circulant matrices drops from 0(N 2 ) to O(NlogN). 
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4.2 Toeplitz and Hankel Matrices 
Definition 2 A L x K matrix of the form 



T = 



( t K 
tK+l 



tK-1 
t K 



t-2 



*2 



XpK+L-l tl<+L-2 ■■■ tL+l thj 

is called a (non-symmetric) Toeplitz matrix. 

A Toeplitz matrix is completely determined by its last column and last row, and 
thus depends on K + L — 1 parameters. The entries of T are constant down the 
diagonals parallel to the main diagonal. 

Given an algorithm for the fast matrix-vector product for circulant matrices, 
it is easy to see the algorithm for a Toeplitz matrix, since a Toeplitz matrix can 
be embedded into a circulant matrix Ck+l-i of size K + L — 1 with the first 
column equals to (t K ,t K+1 , . . . ,t K+L - 1 ,t 1 ,t 2 , ■ . ■ ,ifir_i) T : 



C 



K+L-l 



f t K 


h 


tK+L-1 ■ 


■ t K+ i\ 


tK+l 


ta 


h 


tK+2 


tK+2 


*3 


ta 


tK+3 


tK+L-1 ' ' 


t L 


t L -i ■ 


■ h 


h 


tL + l 


t L 


■ h 


t 2 


tL+2 


tL+l 


■ t 3 



K+L-l 



t-K 



(5) 



+L-2 



t K } 



where the leading L x K principal submatrix is T. This technique of embedding 
a Toeplitz matrix in a larger circulant matrix to achieve fast computation is 
widely used in preconditioning methods [6 . 

Using this embeddings, the Toeplitz matrix-vector product Tv can be rep- 
resented as follows: 

c K+L -i( n v ) = ( T :), (6) 



L-l 



and can be computed efficiently in O ((K + L) log(-ftT + L)) time and 0(K + L) 
memory using the FFT as described previously. 
Recall that an L x K matrix of form 



H 



//ii 
ha 



\h L 

is called Hankel matrix. 



h 2 
h 3 



lL+1 



h K -i 
h K 



h K \ 

tlK+l 



hK+L-2 hx+L-l) 
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A Hankel matrix is completely determined by its first column and last row, 
and thus depends on K + L — 1 parameters. The entries of H are constant along 
the antidiagonals. One can easily convert Hankel matrix into a Toeplitz one by 
reversing its columns. Indeed, define 



P = 



/0 



1 

\i o 



1\ 

1 




0/ 



the backward identity permutation matrix. Then HP is a Toeplitz matrix for 
any Hankel matrix H , and TP is a Hankel matrix for any Toeplitz matrix T. 
Note that P = P T = P^ 1 as well. Now for the product Hv of Hankel matrix 
H and vector v we have 



Hv = HPPv = (HP) Pv = Tw, 

where T is Toeplitz matrix and vector w is obtained as vector v with entries in 
reversed order. The product Tw can be computed using circulant embedding 
procedure as described in Q. 

4.3 Hankel SSA Trajectory Matrices 

Now we are ready to exploit the connection between the time series F — (fj)j =1 
under decomposition and corresponding trajectory Hankel matrix to derive the 
matrix-vector multiplication algorithm in terms of the series itself. In this way 
we will effectively skip the embedding step of the SSA algorithm and reduce the 
computational and storage complexity. 

The entries of the trajectory matrix of the series are hj = fj, j = X, . . . , N. 
The entries of Toeplitz matrix T = HP are tj = hj = fj, j = 1, . . . , N. The 
corresponding first column of the embedding circulant matrix (|5| is 

{tN-L+j = In-l+j, 1 ^ j < L; 
tj-L+l = fj-L+1 L<j^N, 

and we end with the following algorithm for matrix-vector multiplication for 
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trajectory matrix: 



Algorithm 2: Matrix- Vector Multiplication for SSA Trajectory Matrix 

Input: Time series F = (/j) =1 , window length L, vector v of length 
N-L + l. 

Output: p = Xv, where X is a trajectory matrix for F given window 
length L. 

1 c < — {f N-L+i, ■ ■ ■ , In, fi, ■ ■■ , /at-l) T 

2 c< — FFT N (c) 

3 W < (v N - L +l, ■ • ■ , Vl, 0, . . . , 0) T 

4 w < — FFT N (w) 

5 p' i — IFFT N (wQc) 

6 P < {p' Xl . . . ,p' L ) T 

where (w c) denotes element-wise vector multiplication. 

If the matrix- vector multiplication Xv is performed repeatedly with the same 
matrix X and different vectors v then steps [T] [2] should be performed only once 
at the beginning and the resulting vector c should be reused later on. 

The overall computational complexity of the matrix-vector multiplication 
is O(TVlogiV). Memory space of size O(N) is required to store precomputed 
vector c. 

Note that the matrix-vector multiplication of the transposed trajectory ma- 
trix X T can be performed using the same vector c. Indeed, the bottom-right 
K x L submatrix of circulant matrix ^ contains the Toeplitz matrix X T P and 
we can easily modify algorithm [2] as follows: 

Algorithm 3: Matrix- Vector Multiplication for transpose of SSA Trajec- 
tory Matrix 

Input: Time series F = {fj)J— v window length L, vector v of length L. 
Output: p = X T v, where A is a trajectory matrix for F given window 
length L. 

1 C < {f N-L+l, ■ ■ ■ , In, fl, ■ ■■ , fN-h) T 

2 c< — FFT N (c) 

3 w< — (0, . . . ,0,v L) . . . ,Ui) T 

4 w < — FFT N (w) 

5 p' < — IFFT N (wQc) 

6 P< — {p'l-i>--->p'n) 



5 Efficient Rank 1 Hankelization 

Let us recall the diagonal averaging procedure as described in |2.4| which trans- 
forms an arbitrary Lx K matrix Y into a Hankel one and therefore into a series 
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g k ,k = l,...,N. 



fl* = { I Ej=i Vj,k-j+l: L < k < K, (7) 

]HE+lEiU-x+l«J.*-i+i' K^k^N. 

Without loss of generality we will consider only rank 1 hankelization when 
matrix Y is elementary and can be represented as Y — auv T with vectors u and 
v of length L and K correspondingly. Then equation Q becomes 

f Ej=i u j v k-j+i, 1 < fc < L, 

f EjLi«i«fc-i+i, L<fc<X, (8) 

V. 7v-lc+I Ej=fc_jf+i «j«A-j+l> K^k^N. 



9k 



This gives a hint how the diagonal averaging can be efficiently computed. 

Indeed, consider the infinite series u' n , n G Z such that = u„ when 
1 ^ n *S L and otherwise. The series v' n are defined in the same way, so 
v ' n = v n when 1 ^ n ^ K and o 
and v' can be written as follows: 



+oo L 

j°k j-1 - ^2 "J '- 1 

j=~oo j = l 



« * Ok = ••- 1 = > »'.'• 



Ej=i "jWfe-i+i, 1 < fe < L, 

E£=i u jVk-j+i, L < k < K, (9) 

.Ei=*-K+i u 3 v fc-j+i> K ^k^N. 
Comparing the equations Q and (|9| we deduce that 

9k = Cfe « * Ok ' 

where the constants Ck are known in advance. 

The linear convolution u * v can be defined in terms of the circular convolu- 
tion, as follows. Pad the vectors u and v with the zeroes up to length L + K—l. 
Then the linear convolution of the original vectors u and v is the same as the 
circular convolution of the extended series of length L + K — 1 , namely 

L+K-l 

(u*v) k = ^ U j V k-j+l, 
2=1 

where k — j + 1 is evaluated mod (L + K—l) . The resulting circular convolution 
can be calculated effeciently via the FFT |I] : 

(u * v) k = IFFT N (FFT n {v!) FFT N (v')) . 

Here N = L + K—l, and u' , v' denote the zero-padded vectors u and v 
correspondingly. 
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And we end with the following algorithm for the hankelization via convolu- 
tion: 

Algorithm 4: Rank 1 Hankelization via Linear Convolution 
Input: Vector u of length L, vector v of length K, singular value a. 
Output: Time series G — (gj)j =1 corresponding to the matrix Y = auv T 
after hankelization. 

1 u' i — (tii, ■ ■ ■ , u L , 0, . . . , 0) T G R N 

2 u i — FFT N (u') 

s «'«— (v 1: ...,v K ,0,...,0) T eR N 
4»< — FFT N {v') 

5 g' i — IFFTn (v Q u) 

6 w< — (1, . . . , L, L, . . . , L, L, . . . , 1) G R N 

7 g < — a(wQg') 

The computational complexity of this algorithm is O(AlogiV) multiplications 
versus 0(LK) for naive direct approach. Basically this means that the worst 
case complexity of the diagonal averaging drops from 0{N 2 ) to O(NlogN). 

6 Implementation Comparison 

The algorithms presented in this paper have much better asymptotic complexity 
than standard ones, but obviously might have much bigger overhead and thus 
would not be suitable for the real- world problems unless series length and/or 
window length becomes really big. 

The mentioned algorithms were implemented by the means of the Rssa[^ 
package for the R system for statistical computing |25_ . All the comparions were 
carried out on the 4 core AMD Opteron 2210 workstation running Linux. Where 
possible we tend to use state-of-the-art implementations of various computa- 
tional algorithms (e.g. highly optimized ATLAS [55] and ACML implementations 
of BLAS and LAPACK routines). We used R version 2.8.0 throughout the tests. 

We will compare the performance of such key components of the fast SSA 
implementation as rank 1 diagonal averaging and the Hankel matrix-vector mul- 
tiplication. The performance of the methods which use the "full" SSA, namely 
the series bootstrap confidence interval construction and series structure change 
detection, will be considered as well. 

6.1 Fast Hankel Matrix- Vector Product 

Fast Hankel matrix-vector multiplication is the key component for the Lanczos 
bidiagonalization. The algorithm outlined in the section |4.3| was implemented 
in two different ways: one using the core R FFT implementation and another 
one using FFT from FFTW library [Sj. We selected window size equal to the 
half of the series length since this corresponds to the worst case both in terms 

2 The package will be submitted to CRAN soon, yet it can be obtained from author. 



15 



of computational and storage complexity. All initialization times (for example, 
Hankel matrix construction or circulant precomputation times) were excluded 
from the timings since they are performed only once in the beginning. So, we 
compare the performance of the generic DGEMV [1 matrix- vector multication 
routine versus the performance of the special routine exploiting the Hankel 
structure of the matrix. 



• ATLAS DGEMV 

♦ R fast product 

o FFTW fast product 

* ACML DGEMV 




o o ° o- 



0.1 - .A 



Series length, N 



Figure 1: Hankel Matrix- Vector Multiplication Comparison 

The running times for 100 matrix-vector products with different series 
lengths are presented on the figure [T] Note that we had to use logarithmic scale 
for the y-axis in order to outline the difference between the generic routines and 
our implementations properly. One can easily see that all the routines indeed 
possess the expected asymptotic complexity behaviour. Also special routines 
are almost always faster than generic ones, thus we recommend to use them for 
all window sizes (the overhead for small series lengths can be neglected). 

6.2 Rank 1 Hankelization 

The data access during straightforward implementation of rank 1 hankelization 
is not so regular, thus pure R implementation turned out to be slow and we 
haven't considered it at all. In fact, two implementations are under comparison: 
straightforward C implementation and FFT-based one as described in section 
[5} For the latter FFTW library was used. The window size was equal to the 
half of the series length to outline the worst possible case. 



1G 



• Direct 

o FFT-based 



Series length, N 



Figure 2: Rank 1 Hankelization Implementation Comparison 



The results are shown on the figure [2] As before, logarithmic scale was used 
for the y-axis. From this fugure one can see that the computational complexity 
of the direct implementation of hankelization quickly dominates the overhead 
of the more complex FFT-based one and thus the latter can be readily used for 
all non-trivial series lengths. 

6.3 Bootstrap Confidence Intervals Construction 

For the comparison of the implementations of the whole SSA algorithm the 
problem of bootstrap confidence intervals construction was selected. It can be 
described as follows: consider we have the series Fn of finite rank d. Form the 
series F' N = F N + ctem , where £j denotes the uncorrelated white noise (thus the 
series F' N are of full rank) . Fix the window size L < N and let Gn denote the 
series reconstructed from first d eigentriples of series F' N . This way G n is a 
series of rank d as well. 

Since the original series Fn are considered known, we can study large-sample 
properties of the estimate Gn- bias, variance and mean square error. Having the 
variance estimate one can, for example, construct bootstrap confidence intervals 
for the mean of the reconstructed series Gn- 

Such simulations arc usually quite time-consuming since we need to per- 
form few dozen reconstructions for the given length N in order to estimate the 
variance. 
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For the simulation experiments we consider the series Fn = (fi, ■ • ■ ,/jv) of 
rank 5 with 



/2tt n\ 


/2tt n \ 




+ 2.5 sin 


I 13 7V J 


I 37 N 



and <T = 5. The series -Fiooo (white) and F{ 000 (black) are shown on the figure 

12 
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Figure 3: The Series Fiooo an d i^ooo- 

We compare 3 different SSA implementations: one uses the full SVD via 
DGESDD routine from ATLAS library, another one uses the full SVD via the same 
routine from ACML library and third uses fast Hankel truncated SVD and FFT- 
based rank-1 hankelization. Again, FFTW library was used to perform the FFT. 
Window size for SSA was always fixed at half of the series length. 

The running times for the rank 5 reconstruction are presented on the figure 
|4j Note that logarithmic scale was used for the y-axis. From it we can easily 
see that the running times for the SSA implementation via fast Hankel SVD 
and FFT-based rank-1 hankelization are dramatically smaller compared to the 
running times of other implementations for any non-trivial series length. 



6.4 Detection of Structural Changes 

The SSA can be used to detect the structural changes in the series. The main 
instrument here is so-called heterogeneity matrix (H-matrix). We will briefly 
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Series length, N 



Figure 4: Reconstruction Time Comparison. 



describe the algorithm for construction of such matrix, discuss the computa- 
tion complexity of the contruction and compare the performances of different 
implementations . 

Exhaustive exposition of the detection of structural changes by the means 
of SSA can be found in |14j . 

Consider two time series and F^K Let N\ and N2 denote their lengths 
respectively. Take integer L < min (JVi - 1, N 2 ). Let Uf \ j = 1, . . . ,L denote 
the eigenvectors of the SVD of the trajectory matrix of the series F^\ Fix 
I to be a subset of {1, ...,L} and denote = span (Z7j, i € I). Denote by 
X[ 2 \ X [ K ] 2 (K 2 = N 2 -L + 1) the L-lagged vectors of the series . 

Now we are ready to introduce the heterogeneity index |14j which measures 
the descrepancy between the series F^ and the structure of the series F^ (this 
structure is described by the subspace C^): 



9(F^,FV) = ^± ^ \ > , (10) 



yK 2 



(2) 2 



where dist (X, C) denotes the Euclidian distance between the vector X and 
subspace C. One can easily see that < g < 1. 

Note that since the vectors Ut form the orthonormal basis of the subspace 
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£W equation ( 10 1 can be rewritten as 



7 ( F (l) iF (2) ) = 1 



V^2 
2^ = 1 



2 



Having this descrepancy measure for two arbitrary series in the hand one 
can obviously construct the method of detection of structural changes in the 
single time series. Indeed, it is sufficient to calculate the heterogeneity index g 
for different pairs of subseries of the series F and study the obtained results. 

The heterogeneity matrix (H-matrix) |14| provides a consistent view over 
the structural descrepancy between different parts of the series. Denote by Fij 
the subseries of F of the form: Fi j = (f%, ■ ■ ■ , fj)- Fix two integers B > L 
and T > L. These integers will denote the lengths of base and test subseries 
correspondingly. Introduce the H-matrix Gb,t with the elements gij as follows: 

9ij = 9{Fi,i+B,Fj,j+T), 

for i = 1 , . . . , N — B + 1 and j = 1 , . . . , N — T + 1 . In simple words we split 
the series F into subseries of lengths B and T and calculate the heteregoneity 
index between all possible pairs of the subseries. 

The computation complexity of the calculation of the H-matrix is formed by 
the complexity of the SVDs for series i^i+B and complexity of calculation of 



all heteregeneity indices g^ as defined by equation (10 1. 

The worst-case complexity of the SVD for series -F^i+B corresponds to the 
case when L ~ B/2. Then the single decomposition costs 0(B 3 ) for full SVD via 
Golub-Reinsh method and 0(kB log B + k 2 B) via fast Hankel SVD as presented 
in sections [3] and [4] Here k denotes the number of elements in the index set 
I. Since we need to make N — B + 1 decompositions a total we end with the 
following complexities of the worst case (when B ~ N/2): 0(N A ) for full SVD 
and 0(kN 2 \ogN + k 2 N 2 ) for fast Hankel SVD. 

Having all the desired eigenvectors Ui for all the subseries Fi t i + B in hand one 
can calculate the H-matrix G b .t in 0{kL(T- L + l)(N-T+ 1) + (N- T+ 1)L) 
multiplications. The worst case corresponds to L ~ T/2 and T ~ N/2 yielding 
the 0(kN 3 ) complexity for this step. We should note that this complexity can 
be further reduced with the help of fast Hankel matrix- vector multiplication, 
but for the sake of simplicity we won't present this result here. 

Summing the complexities we end with 0(N 4 ) multiplications for H-matrix 
computation via full SVD and 0(kN 3 + k 2 N 2 ) via fast Hankel SVD. 

For the implementation comparison we consider the series Fn = (fi, ■ ■ ■ , /jv) 
of the form 



' 2 »+0.01e„, n<Q, 
io 5 -n) -I- 0.01e„, n>Q, 



fn= c )z\ :r :r cu) 



where s n denotes the uncorrelated white noise. 
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The typical H-matrix for such series is shown on the figure [5] (compare with 
figure 3.3 in [2]). The parameters used for this matrix are N — 400, Q = 200, 
B = 100, T = 100, L = 50 and I = {1, 2}. 

To save some computational time we compared not the worst case complex- 
ities, but some intermediate situation. The parameters used were Q = N/2, 
B = T = N/4, L = B/2 and I = {1,2}. For full SVD DGESDD routine from 
ATLAS library [55] was used (as it was shown below, it turned out to be the 
fastest full SVD implementation available for our platform). 

The obtained results are shown on the figure [6] As before, logarithmic scale 
was used for the y-axis. Notice that the implementation of series structure 
change detection using fast Hankel SVD is more than 10 times faster even on 
series of moderate length. 



7 Real- World Time Series 

Fast SSA implementation allows us to use large window sizes during the decom- 
position, which was not available before. This is a crucial point in the achieving 
of asymptotic separability between different components of the series [2] . 
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Series length, N 



Figure 6: Structure Change Detection Comparison 



7.1 Trend Extraction for HadCET dataset 

Hadley Centre Central England Temperature (HadCET) dataset [22_ is the 
longest instrumental record of temperature in the world. The mean daily data 
begins in 1772 and is representative of a roughly triangular area of the United 
Kingdom enclosed by Lancashire, London and Bristol. Total series length is 
86867 (up to October, 2009). 

We applied SSA with the window length of 43433 to obtain the trend and few 
seasonal components. The decomposition took 22.5 seconds and 50 eigentriples 
were computed. First eigentriple was obtained as a representative for the trend 
of the series. The resulting trend on top of 5-year running mean temperature 
anomaly relative to 1951-1980 is shown on figure [7] (compare with figure 1 in 

El)- 



7.2 Quebec Birth Rate Series 

The series obtained from |18| contains the number of daily births in Quebec, 
Canada, from January 1, 1977 to December 31, 1990. They were discussed in 
|14| and it was shown that in addition to a smooth trend, two cycles of different 
ranges: the one-year periodicity and the one-week periodicity can be extracted 
from the series. 

However, the extraction of the year periodicity was incomplete since the 
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Figure 7: Hadley CET Series and Extracted Trend 



series turned out to be too smooth and do not uncover the complicated structure 
of the year periodicity (for example, it does not contain the birth rate drop 
during the Christmas). Total series length is 5113, which suggests the window 
length of 2556 to be used to achieve the optimal separability. This seems to have 
been computationaly expensive at the time of book written (even nowadays the 
full SVD of such matrix using LAPACK takes several minutes and requires decent 
amount of RAM). 

We apply the sequential SSA approach: first we did the decomposition using 
window size of 365. This allows us to extract the trend and week periodicity in 
the same way as described in [2]. After that we performed the decomposition 
of the residual series using the window length of 2556. 100 eigentriples were 
computed in 3.2 seconds. The decomposition cleanly revealed the complicated 
structure of the year periodicity: we identified many components corresponding 
to the periods of year, half-year, one third of the year, etc. In the end, the 
eigentriples 1—58 excluding 22—24 were used during the reconstruction. The end 
result containing trend and year periodicity is shown on the figure [8] (compare 
with figure 1.9 in |14|). It explains the complicated series structure much better. 
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Year 

Figure 8: Quebec Birth Rate: Trend and Year Periodicity 

8 Conclusion 

The stages of the basic SSA algorithm were reviewed and their computational 
and space complexities were discussed. It was shown that the worst-case com- 
putational complexity of 0(N 3 ) can be reduced to 0(kN log N + k 2 N), where 
k is the number of desired eigentriples. 

The implementation comparison was presented as well showing the superi- 
ority of the outlined algorithms in the terms of running time. This speed im- 
provement enables the use of the SSA algorithm for the decomposition of quite 
long time series using the optimal for asymptotic separability window size. 
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